Corpus and Text Analysis of Spontaneous Japanese

نویسنده

  • Hitoshi Isahara
چکیده

There are three major parts of the “Spontaneous Speech: Corpus and Processing Technology” project; (1) compilation of large spontaneous speech corpus, (2) establishment of spoken language engineering based on the corpus, and (3) developing a prototype of a spoken language summarization system. This paper describes how we help to develop this large corpus, i.e., (1), using technology developed as a part of (2). Firstly, we discuss how to annotate whole corpus morphologically. Secondly, we explain how we annotate sentence boundaries. And thirdly we discuss discourse annotation for CSJ. This paper describes overviews of these works and details of the works described in this paper are explained in the other papers in this volume.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Dependency-structure Annotation to Corpus of Spontaneous Japanese

In Japanese, syntactic structure of a sentence is generally represented by the relationship between phrasal units, or bunsetsus in Japanese, based on a dependency grammar. In the same way, the syntactic structure of a sentence in a large, spontaneous, Japanese-speech corpus, the Corpus of Spontaneous Japanese (CSJ), is represented by dependency relationships between bunsetsus. This paper descri...

متن کامل

Progress of Speech Recognition using the Corpus of Spontaneous Japanese (CSJ)

The report gives an overview of the current state of spontaneous speech recognition using the “Corpus of Spontaneous Japanese (CSJ)”. It is shown that the large-scale corpus had strong impact in training acoustic and language models considering morphological and pronunciation variations which are characteristic to spontaneous Japanese. Unsupervised adaptation of these models and the speaking ra...

متن کامل

Spontaneous Speech Corpus of Japanese

Design issues of a spontaneous speech corpus is described. The corpus under compilation will contain 800-1000 hour spontaneously uttered Common Japanese speech and the morphologically annotated transcriptions. Also, segmental and intonation labeling will be provided for a subset of the corpus. The primary application domain of the corpus is speech recognition of spontaneous speech, but we plan ...

متن کامل

Pronunciation variant analysis using speaking style parallel corpus

To improve the recognition accuracy for spontaneous conversational speech, we collected a corpus to study how spontaneous conversational speech differs from read style speech. The corpus consists of two parts: 1) spontaneous conversational speech and 2) read speech with the same word transcriptions as the conversational speech. In word and phone recognition experiments, it was confirmed that, f...

متن کامل

Morphological Annotation of a Large Spontaneous Speech Corpus in Japanese

We propose an efficient framework for humanaided morphological annotation of a large spontaneous speech corpus such as the Corpus of Spontaneous Japanese. In this framework, even when word units have several definitions in a given corpus, and not all words are found in a dictionary or in a training corpus, we can morphologically analyze the given corpus with high accuracy and low labor costs by...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003